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Objectives and Scope 


Course Goals 

■i Introduction to basic statistical terms 

Provide a structured overview of statistical concepts used during application of EXL DA methodology 
Explain interpretation and basis of statistical distributions and hypothesis testing 
Provide helpful “tricks of the trade” 


Beyond the Scope of this Training 

Comprehensive coaching on Statistics 

Derivation of statistical formulas or terms (unless required as part of methodology explanation) 


Self Study Goals 

■8 In-depth research on advanced Statistical concepts 
■i Innovations and new techniques related to methodology 
Discussion on advanced concepts can be taken up offline 
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Chapter 1: Dataset Basics 
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1.1 Data Set Dimensions 



1 . 1.1 DataSet <r Variables -> 

'I s 

A data set contains information in a set of rows and columns 

■ Row -> Observation -■§ 

ro 

■ Column -> Variable | 

w 

-Q 

O 

T 

1.1.2. Observation 

Observations are the objects described by a set of data. They may be anything. 

Few examples: 

Individuals (like customers of a bank, employees of a company, students of a class) 

Transactions 
Accounts 
Zip Codes 


1.1.3. Variable 

A variable is any characteristic of an observation and can take different values for different observations. 
Few examples: 

A customer’s tenure or an employee’s salary or a student’s marks 
A transaction’s date, time or amount 
An account’s balance or spending limit 
A zip code’s population 
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1.2 Variable Type 



1.2.1 Categorical Variable 

A categorical variable places each observation into one of the several groups or categories, like the gender variable 
with male or female categories 

Such variables are best represented by a Pie Chart or a Bar Graph 


Gender Distribution 

Male 


Female 

60 % 



Gender Distribution 



Male Female 


1.2.2. Quantitative Variable 

A quantitative variable has numerical values that measure some characteristic of each observation, like height in 
centimeters or salary in dollars per year 

Such variables are best represented by a Histogram 


Income Distribution 



^ Things to Remember 

Before drawing the histogram, the 
quantitative variable is divided into 
equal classes 
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Exercise 



Exercise 1. In the ‘Bar Graph’ of Gender Distribution (in Section 1.2 I), there are two categories. In the ‘Histogram’ 
of Income Distribution (in Section 1 .2.2), there are five categories. Otherwise, how are they different from each 
other? 

[Hint: Does the sequence of categories matter?] 


Xexl 
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Scales 


1.3 Scales of Measurement 



1.3.1 Properties of Measurement Scales 

Four Key Properties Relating to Scales of Measurement 



Each number has a 
particular meaning 

Numbers have an 
inherent order from 
smaller to larger 

Differences between 
numbers (units) 
anywhere on the 
scale are the same 

Zero point represents 
the absence of the 
characteristic being 
measured 


* 

* 

* 

* 


Identity 

Magnitude 

Equal Intervals 

True Zero 

Nominal Scale 

? 

? 

? 

? 

Ordinal Scale 

? 

? 

? 

? 

Interval Scale 

? 

? 

? 

? 

Ratio Scale 

? 

? 

? 

? 
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1.3.2. Nominal Scale 


Example: ‘Marital Status’ 

Consider a variable ‘Marital Status’ with three categories: Single, Married and Divorced 
Arbitrarily, a number is assigned to the three categories 


Scenario A: Single denoted by ‘O’, Married 
denoted by ‘1 ’ and Divorced denoted by ‘9’ 


Marital Status 



0 1 9 

(Single) (Married) (Divorced) 


Scenario B: Single denoted by ‘1’, Married 
denoted by ‘0’ and Divorced denoted by ‘2’ 

Marital Status 



0 1 2 
(Married) (Single) (Divorced) 


As Observed 

1. Scale points ‘0’, ‘1 ’ and ‘9’ have specific meanings with respect to a scenario 

2. Scale points are mere names of the categories. The categories on nominal scale can not be rank-ordered 

3. Difference between two consecutive scale points is not ‘equal’. In fact, calculation of such difference is not even meaningful. 

4. The scale point ‘0’ is not the ‘true’ zero point (Here ‘0’ does not mean ‘absence’ of marital status) 


Xexl 
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1.3.3. Ordinal Scale 


Example: ‘Preference Rank’ for Car’s Color 

Suppose a car buyer is asked to provide preference ranking for colors: white, black, red and blue 




Scenario A 

Scenario B 

Color 

Buyer’s State of Mind 

Preference Rank 
(on a scale of 1 to 4) 

Preference Rank 
(on a scale of 0 to 3) 

Black 

1 simply love black 

1 

0 

White 

1 strongly like white too 

2 

1 

Red 

Red looks very sporty 

3 

2 

Blue 

1 hate blue 

4 

3 


As Observed 


1 Every scale point (preference rank) has a particular meaning with respect to a scenario 

2 Each number on the scale is different from the other and all of them can be rank-ordered 


3. Rank difference between the last two preferences can not be assumed to be equivalent to rank difference between the first 
two preferences. Clearly, the customer’s preference for black over white is not as strong as his preference for red over blue. 

4. The scale point ‘0’ is not the ‘true’ zero point (In Scenario B, black color has ‘0’ rank, which does not mean the absence of 

preference. Rather it is the most preferred color.) V 
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1.3.4 Interval Scale 


Example: Temperature Measurement on Celsius Scale 

The thermometer identifies how many units of mercury correspond to the temperature measured 


130 


80 


1 


30 


-20 


As Observed 

1. Every scale point has a particular meaning 

2. Each number on scale is different from the other and all of them can be rank-ordered (80 °C is hotter than 60°C) 

3. There is the same 10-degree difference in temperature between 20° and 30° as between 50° and 60° 

4. 0°C is not a ‘true’ zero point. It does not indicate absence of temperature; it is an arbitrary point on the scale 
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1.3.5. Ratio Scale 


Example: Purchasing Power 

The purchasing power is measured in terms of money 




Purchasing Power 

$0 

$25 

$50 $75 $100 $125 $150 


As Observed 

1. Every scale point has a particular meaning 

2. Numbers on scale can be rank-ordered ($150 has more purchasing power than $125) 

3. There is the same $25 difference in purchasing power between $100 and $125 as between $50 and $75 

4. $0 is a ‘true’ zero point. $0 means no money and absolutely no ability to purchase anything 
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Scales 


1.3.6. Scales of Measurement: Summary 



Four Key Properties Relating to Scales of Measurement 


Each number has a 

Numbers have an 

Differences between 

Zero point represents 

particular meaning 

inherent order from 
smaller to larger 

numbers (units) 
anywhere on the 
scale are the same 

the absence of the 
characteristic being 
measured 

# 

* 

* 

* 

Identity 

Magnitude 

Equal Intervals 

True Zero 


Nominal Scale 
Ordinal Scale 
Interval Scale 
Ratio Scale 
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Exercise 


Exercise 2. List down at least 3 examples each for 

a. Nominal Scale Variable 

b. Ordinal Scale Variable 

c. Interval Scale Variable 

d. Ratio Scale Variable 
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Chapter 2: Univariate Analysis 
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2.1 Univariate Analysis 



2.1.1. Plotting and Visualizing a Distribution 

Recall the Histogram of Dollar Income plotted in Section 1.2.2. 

Income Distribution 


60 

>. 



< 25K 25K-50K 50K-75K 75K-100K 100K+ 


Income (in $) 


After Adding a Polynomial Trend Line 

Income Distribution 



Income (in $) 


2.1.2. Examining a Distribution 

Look for the overall pattern and for striking deviations from that pattern 



Center Spread Shape 

(Measures of Central Tendency) (Measures of Dispersion) (Skewness / Kurtosis) 

Xexl 


19 | June 30, 2015 | © 2015 ExIService Holdings, Inc. 





































2.1.3. Percentiles 

A percentile is the value of a variable below which a certain percent of observations fall. 


Q> SAS Tip 

PROC UNIVARIATE 


The p th percentile is a value such that 

At most (p)% of the observations are less than this value; and 
At most (1 - p)% are greater 
when the data is sorted 

Examples: 

PI (1 st percentile) cuts off the lowest 1 % of sorted data 
P98 (98 th percentile cuts off the lowest 98% of sorted data 


Term Used 

Number of Splits 

Quartiles 

4 

Deciles 

10 

Pentiles 

20 

Percentile 

100 


CD 


100 
95 
90 
85 
80 
75 
70 
65 
60 
■4= 55 
§50 
CD 45 
40 
35 
30 
25 
20 
15 
10 
5 
0 



P100 or Maximum 


P75 or Q 3 


P50 or Median or Q 2 


P25 or 


P0 or Minimum 


r 
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2.2 Center (Measures of Central Tendency) 



2.2.1 Mean 

Mean is arithmetic average of the observations. 


Q> SAS Tip 

PROC MEANS 


X = 


/’=! 

n 


Xr + X. 


+ • • • + X 
n 


n 


n : Number of observations 

X, : i th observation 


The Most Common Measure of Central Tendency 


Think about it.. 

If frequency of any observation is more 
than 1 (e.g. if two students scored the 
same marks), how would the formula 
for mean calculation look like? 


Affected by Extreme Values (Outliers) 

‘Center of Gravity’ of a Distribution - the point on which the distribution would balance 


Example: Find average marks scored by a group of 4 students: A (1 mark), B (3 marks), C (6 marks), D (10 marks) 


o o Q 


n 


0123456789 10 



— _ 1 + 3 + 6 + 10 _ 20 
4 4 
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2.2.2. Median 

Median is the midpoint of a distribution. 


M = 


U+ l v * 


Observation 


V z J 


A Robust Measure of Central Tendency 
Not Affected by Extreme Values (Outliers) 


If number of observations (n) is odd , the median (M) 
the center observation in the ordered list 

OOP _ o o 

0123456789 10 
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Q> SAS Tip 

PROC MEANS 


^ Things to Remember 

Before calculating median, always arrange all observations 
in order of size, from smallest to largest. However, if you 
are using PROC MEANS in SAS, there is no need to sort 
manually. SAS does it automatically. 


If number of observations (n) is even , the median (M) is the 
mean of the two center observations in the ordered list 

O O O QO O 

0123456789 10 



Median = (5+7)/2 = 6 


Xexl 























2.2.3. Mode 

Mode is the most frequently occurring observation. 


There may be multiple modes 

_Q_ O O oBo _ OOP 

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 




Not Affected by Extreme Values (Outliers) 

Can be used for Numerical as well as Categorical Data 
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Q> SAS Tip 

PROC UNIVARIATE 


There may not be a mode 


ooooooo 

0 1 2 3 4 5 6 

— 

No Mode 

V_ 
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2.3 Spread (Measures of Dispersion) 



2.3.1. Range 

Range is the difference between the maximum and the minimum values 


Range = MaxValue - MinValue 

By definition, Range is affected by Extreme Values 
Independent of data distribution 


OOP o o 

0123456789 10 


Range = 10-1 =9 


Q> SAS Tip 

PROC UNIVARIATE 


2.3.2. Inter-Quartile Range 

Inter-Quartile Range is the difference between the first and the third quartile 


Mid-spread (Spread in the middle 50%) 
■i Independent of Extreme Values 

m = <2 3 - G, 


Inter-Quartile Range 


P0 P25 P50 P75 PI 00 


25% 

25% 

25% 

25% 





ix(n+ 1 ) 


\ 


th 


V 


Observatio n 


J 



Median of 1 st Half Overall Median Median of 2 nd Half 
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2.3.3. Variance 

Variance is an average of the squares of the deviations of a set of observations from their mean. 


n - 2 

a 2 (x 1 -Y ) 2 + (x 2 -Y ) 2 + ... + (x„-Yf j~X) 

n — 1 n — 1 

Affected by Extreme Values (outliers) 

Variance = 0 if there is no spread (when all observations have the same value); Otherwise Variance > 0 


2.3.4. Standard Deviation 

Standard deviation is the square root of the variance. 


o 


Z(X-X) 

i = 1 

n — 1 


2 


0 SAS Tip 

PROC MEANS 


Affected by Extreme Values (outliers) 

Std. Deviation = 0 if there is no spread (when all observations have the same value); Otherwise Std. Dev. > 0 
Std. Deviation has the same units of measurement as the original observations 
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2.3.5. Coefficient of Variation 

Coefficient of variation (CV) is a normalized measure of dispersion of a probability distribution. 


CV = ix 100 % 

X 

CV is unit-less and hence can be used for comparing variation in data sets with different units 

When the mean value is close to zero, CV will approach infinity and is hence sensitive to small changes in the 
mean 

CV is sensitive to outliers 


Xexl 
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2.4 Shape (Skewness / Kurtosis) 



2.4.1 Skewness 

Skewness is the degree of departure from symmetry of a distribution. 

Positively Skewed Distribution : Tail pulled in the positive direction 

■. Negatively Skewed Distribution : Tail pulled in the negative direction 



Q> SAS Tip 

PROC UNIVARIATE 


2.4.2. Kurtosis 


Kurtosis is the degree of peakedness of a distribution. 


■1 

Mesokurtic Distribution 

: Normal distribution 

■1 

Leptokurtic Distribution 

: Higher peak and heavier tails 

■1 

Platykurtic Distribution 

: Lower peak and lighter tails 


Mesokurtic Curve : Kurtosis = 3 

Leptokurtic Curve : Kurtosis > 3 

Platykurtic Curve : Kurtosis < 3 
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Example 



Example: For 16 students, marks in Mathematics are given below: 

40, 31,19, 22, 47, 25, 24, 32, 43, 12, 20, 5, 6, 21,43, 67 

Let us calculate measures of central tendency and dispersion, without using SAS. 

Arrange marks in ascending order: a 

5, 6,12,19, 20, 21,22, 24, 25, 31,32, 40, 43, 43, 47, 67 


Inter-Quartile Range = Q 3 - Q-, = 41.5 - 19.5 = 22 


Qi = 19.5 


Qo = 41.5 


oooooooooooooooo 


5 6 12 19 20 21 22 24 


25 31 32 40 43 43 47 67 


f- 

Min = 5 


Median = 24.5 


f -\ 

Mode = 43 

Max = 67 


T 


Range = Max - Min = 67 - 5 = 62 
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Marks (X) 

X - Mean 

(X-Mean) 2 

2 

5 

- 23.6 

555.2 

3 

6 

- 22.6 

509.1 

4 

12 

- 16.6 

274.3 

5 

19 

- 9.6 

91.4 

6 

20 

- 8.6 

73.3 

7 

21 

- 7.6 

57.2 

8 

22 

- 6.6 

43.1 

9 

24 

- 4.6 

20.8 

10 

25 

- 3.6 

12.7 

11 

31 

2.4 

5.9 

12 

32 

3.4 

11.8 

13 

40 

11.4 

130.8 

14 

43 

14.4 

208.4 

15 

43 

14.4 

208.4 

16 

47 

18.4 

339.9 

17 

67 

38.4 

1477.4 

18 

X = 457 

(X - Mean) = 0.0 

(X-Mean) 2 = 4019.9 

— 457 

X = = 28.56 

16 

c [401917 _ 16.37 ™ = 11 = 11 * 

\ ( 16 - 1 ) 28 \ 

AEXL 





























Example 


DART 


For the same example, let us compute measures of center, spread and shape using SAS. 


Convert Raw Data to SAS Format 

data outlib.marks_data; 
infile datalines; 
input marks; 
datalines ; 

40 

31 

19 
22 
47 
25 
24 

32 
43 
12 

20 

5 

6 

21 

43 

67 


Run PROC UNIVARIATE on SAS Data Set 


proc univariate data = outlib.marks_data modes; 
var marks; 

output out =} outlib,marks_univariate j -- 

n = n 


mm 

= mm 

mean 

= mean 

median 

= median 

mode 

= mode 

max 

= max 

range 

= range 

ql 

= ql 

q3 

= q3 

qrange 

= qrange 

std 

= std_dev 

skewness 

= skewness 

kurtosis 

= kurtosis 


Output Data Set 


run; 


Output 


| =lP marks_univariate.sas7bdat | 

n mean std_dev 

skewness 

kurtosis 

max 

q3 

median 

ql 

min 

range 

qrange 

mode 

1 16 28.5625 16.37058 

0.68004 

0.542626 

67 

41.5 

24.5 

19.5 

5 

62 

22 

V, 43 

I i—\ 
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2.5 Box Plot 



A box-plot is a graph of the summary of a variable’s distribution 


A central box spans the quartiles Q 1 and Q 3 
A line in the box marks the median M 
A symbol (+) marks the mean 

Lines extend from the box out to the min. and max. values 

Let us create the box plot graph for the previous example. 


univariate.sas 


data outlib .marks_data; 
set outlib .marks_data; 
format Boxplot $1.; 
Boxplot = ""; 

run; 


ods select ssplots ; 

proc univariate data = outlib.marks_data; 
var marks; 
by Boxplot; 

run; 



O Some More Information 

Above snippet works well in Base SAS. 

For using PROC BOXPLOT, you will need SAS/GRAPH 


univariate.lst - Notepad 


The UNIVARIATE Procedure 
variable; marks 

Schematic Plots 


I 

70 + 

I 

I 

60 + 
I 

I 

50 + 

I 

I 

40 + 


30 


20 


I 

+ 
I 

I 

+ 
I 

I 

10 + 
I 

I 

0 + 

BOXplOt 





- + - 


Maximum = 67 
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Exercise 



Exercise 3. For 15 randomly chosen employees working in Delhi NCR, per month salary (in thousand rupees) is 
given below: 

4, 25, 30, 30, 30, 31,32, 35, 50, 50, 50, 55, 60, 74, 110 

a. In MS Excel spreadsheet (without using inbuilt functions), calculate 

i. Mean 

ii. Median 

iii. Mode 

iv. Range 

v. Inter-Quartile Range 

vi. Variance 

vii. Standard Deviation 

viii. Coefficient of Variation 

b. In SAS (using SAS procedures), compute and report all of the above metrics along with Skewness and 
Kurtosis 

c. Using SAS, create box-plot graph 

Exercise 4. In a class of 50 students, all scored 70 marks in Statistics. Calculate Mean, Median, Mode, Range, 
Inter-Quartile Range, Variance, Standard Deviation and Coefficient of Variation. 

[Hint: Do you really need a calculator or any software to answer this?] 
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Chapter 3: Sampling Distributions 
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3.1 Sampling 



3.1.1. Sample Selection 


True Population: The 

entire collection of data, 
which we wish to analyze, 
describe or draw 
conclusions about 



o 


Sample Population: The part of 
true population that we actually 
examine in order to gather 
information and draw valid 
conclusions about the larger group 


Two Most Frequently Used Methods of Sample Selection 

Simple Random Sampling: 

This method selects units with equal probability and without replacement 

Each possible sample of n different units out of N has the same probability of being selected 

Selection probability for each unit = n / N 


0 SAS Tip 


PROC SURVEYSELECT 


Stratified Random Sampling: 

This method selects random samples independently within strata 
Selection probability for a unit in stratum h = n h / N h 

Other Methods include Sequential Random Sampling, Systematic Random Sampling etc. 
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3.1.2. Parameter and Statistic 

A parameter is a value, usually unknown (and which therefore has to be estimated), that has to be estimated with 
minimum error using statistic from one or more samples. 


Parameter 

Statistic 

Describes a characteristic of a population 

Fixed Value within the population 

Generally represented as Greek alphabets (e.g. ) 

Describes a characteristic of a sample 

Value varies from sample to sample within the same 
population 

Generally represented as English alphabets (e.g. S) 


Example: 

A properly chosen sample of 1600 people across the country was asked if they regularly watch a certain television 
program, and 24% said yes. 

Parameter : The true proportion of all people in the country who watch the program 
Statistic : 24% (as obtained from the sample of 1600 people) 
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3.1.3. Sampling Distribution of a Statistic 

The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of 
the same size (n) from the same population. 

Sampling distribution is nothing but the probability distribution of the statistic - It tells what values a statistic 
takes and how often it takes those values in repeated sampling. 

Example: A box consists of 10 counters. The color of the counter can be 


■ 

Yellow 

© 

(Score Value = 1) 

■ 

Green 

© 

(Score Value = 2) 

■ 

Red 

• 

(Score Value = 5) 

■ 

Blue 

© 

(Score Value = 4) 


Suppose there are 2 yellow, 2 green, 3 red and 3 blue counters in the box. A person is asked to draw 2 counters from this box. 

All possibilities of drawing 2 counters Frequency Distribution 

©<$><$><% %> o 4 ‘ 

© % % % © % ’ % ’ %> * l *“"’ 

• %* •%’<$)• 

© ’ 

XL 


Frequency Distribution 


© 

©) 1 

©) 4 

% 6 

<§> 6 

1+4+6+6 =17 

© 

1 

% 6 

% 6 


1+6+6=13 

• 

% 3 

ft) 9 



3+9=12 

© 

3 




3 


17 + 13 + 12 + 3 = 52 


Determine the sampling distribution of the mean score of this person. 
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Frequency Distribution and Mean of Scores Obtained in Each Sample 


© 

<s> : 

©) 4 
xfj 1.5 


©) 6 
yy 2.5 

1+4+6+6 = 17 

® 


^ 6 

^ 3.5 

<%>: 


1+6+6=13 

• 


^ L 



3+9=12 

© 

%: 




3 



Frequency 
Mean Scores 


17 + 13 + 12 + 3 = 45 



If a random sample of size ‘n’ is drawn from a population 
with mean p and standard deviation , then the sampling 
distribution of the means has mean p and standard deviation 

/ Vn 


Sampling Distribution of Mean Score 


Mean 

Frequency 

Probability 

1 

1 

1/45 

1.5 

4 

4/45 

2 

1 

1/45 

2.5 

6 

6/45 

3 

12 

12/45 

3.5 

6 

6/45 

4 

3 

3/45 

4.5 

9 

9/45 

5 

3 

3/45 

Sum 

45 

1 


3.1.4. Standard Error 

Standard error is the standard deviation of the sampling distribution of the statistic. If statistic is the sample mean, 
and the samples are uncorrelated, the standard error is given by: 

a 


a 
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3.2 Random Variable & Probability Distributions 



3.2.1 Random Variable 

A random variable is a variable whose value is a numerical outcome of a random phenomenon. It may be discrete 
or continuous. 


i. Discrete Random Variables: Random variables that have a finite (countable) list of possible outcomes, with probabilities 
assigned to each of these outcomes. Examples: 

■ Number of Cars Owned (0, 1,2, 3, ...) 

■ Attendance (in terms of numbers of days) in a month (1,2,3, ... , 31) 


2 . Continuous Random Variables: Random variables that can take on any value in an interval, with probabilities given as areas 
under a density curve. Example: 

■ Weight 

■ Temperature 


Type of Random Variable 


Probability Distribution Function 



Probability Mass Function 
(p.m.f) 


Probability Density Function 
(p.d.f.) 


The probability distribution of a random variable is the mathematical function describing the possible values of 
a random variable and their associated probabilities 


X 


EXL 
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3.2.2. Probability Mass Function (p.m.f.) 

A probability mass function (p.m.f.) is a function that gives the probability that a discrete random variable X is 
exactly equal to some value x. 

Probability Mass Function : f(x) = P(X = x ) such that the following two conditions satisfy: 

1. 0< f(x) < 1 for every x e S 

2 . £/(*) = 1 

xeS 

where S is the Sample Space 


Coin Toss Example: Toss three unbiased coins and observe ‘Number of Heads’. 
Random Variable: Number of Heads Observed (which may take four values: 0, 1,2 or 3) 
Sample Space = {TTT, HTT, THT, TTH, HHT, HTH, THH, HHH} 


Probability Mass Function 


X 

f(x) 

0 

f(0) = P(X = 0) = P(TTT) = 1/8 

1 

f(1) = R(X = 1) = P(HTT, THT, TTH) = 3/8 

2 

f(2) = P(X = 2) = P(HHT, HTH, THH) = 3/8 

3 

f(3) = P(X = 3) = P(HHH) = 1/8 

Sum 

f(X) = 1/8 + 3/8 + 3/8 + 1/8 = 1 
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Probability Mass Function 
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3.2.3. Probability Density Function (p.d.f.) 

A probability density function is a function that describes the relative likelihood for a continuous random variable 
X to take values within a particular interval [a, b]. 

b 

Probability Density Function: P{a <X < b) = j f(x)dx such that the following two conditions satisfy: 

a 

1. / ( x) is a non - negative integrable function 

oo 

2. ^f(x)dx = 1 

—00 


In case of a continuous random variable 

There are infinite number of outcomes and hence a probability can not be assigned to each individual 
Probabilities are assigned to intervals of outcomes by using areas under density curves 
A density curve has area exactly 1 underneath it, corresponding to total probability 1 


Probability Density Function 


Area under the shaded 
area = P(a < X < b) 



Xexl 
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3.2.4 Cumulative Distribution Function (c.d.f.) 

A cumulative distribution function describes the probability that a random variable X with a given probability 
distribution will be found at a value less than or equal to x. Intuitively, it is the "area so far" function of the probability 
distribution. 


CumulativeDistributbn Function: F(x)-P(X < x ) 

i 

For DiscreteRandomVariable: F(x t ) = P(X < x t ) - J j P(X=x t ) 

k =1 

X 

For Continuous Random Variable: F(x t ) = P(X < x.) = j* fit) dt 

—oo 

Properties of CDF: 0 < F(x)< 1, limF(x) = 0, limF(a:) = l 

X —>—00 x —»+co 


[Note: P(X =xj = F(x k )~ F{x k _ x )] 
[Note: / is probability density function] 


-Q— o- 


c.d.f. for Discrete Random Variable 

1 - 


0 - 

c.d.f. for Continuous Random Variable 


Coin Toss Example Continued... 


X 

f(x) 

F(x) 

0 

f(0) = 1/8 

F(1) = 1/8 

1 

f(1) = 3/8 

F(2) = 1/8+3/8 = 4/8 

2 

f(2) = 3/8 

F(3) = 4/8+3/8 = 7/8 

3 

f(3) = 1/8 

F(4) = 7/8+1/8 = 1 


f(X) = 1 


C.D.F. for a Discrete Variable 





EXL 
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3.3 List of Distributions 


3.3.1. Examples of Discrete Distributions 


1. 

Discrete Uniform Distribution 

fDetailsl 

2. 

Binomial Distribution 

fDetailsl 

3. 

Hypergeometric Distribution 

fDetailsl 

4. 

Poisson Distribution 

fDetailsl 

5. 

Geometric Distribution 

fDetailsl 

6. Negative Binomial Distribution 

Note: ‘Details’ link leads to Appendix A.2 

fDetailsl 

3.3.2. Examples of Continuous Distributions 

i. 

Continuous Uniform Distribution 

fDetailsl 

2. 

Normal Distribution 

fDetailsl 

3. 

Chi-Square Distribution 

fDetailsl 

4. 

F Distribution 

fDetailsl 

5. 

Student’s t Distribution 

fDetailsl 

6. 

Exponential Distribution 

fDetailsl 

7. 

Gamma Distribution 

fDetailsl 

8. Beta Distribution 

Note: ‘Details’ link leads to Appendix A.3 
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Exercise 



Exercise 5. In an excel spreadsheet, plot 

1. Probability Mass Function of Binomial Distribution 

2 . Probability Mass Function of Hypergeometric Distribution 

3. Probability Mass Function of Poisson Distribution 

4 . Probability Mass Function of Negative Binomial Distribution 

5. Probability Density Function of Normal Distribution 

6. Probability Density Function of Exponential Distribution 

7. Probability Density Function of Gamma Distribution 

[Hint A: Explore Excel functions: BINOMDIST, HYPGEOMDIST, POISSON, NEGBINOMDIST, NORMDIST, 
EXPONDIST and GAMMADIST] 

[Hint B: Refer Appendix A.2 and A.3] 


Xexl 
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Chapter 4: Hypothesis Testing 
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4.1 Key Concepts 



4.1.1 Statistical Hypothesis 

A statistical hypothesis is some assumption or statement about a population parameter. 

This assumption may or may not be true 
It is tested on the basis of the evidence from a random sample 

Example: Average sales amount (i.e. sales amount per transaction) of a grocery store is $150. 

Statistical hypotheses are of two types: 

a. Null Hypothesis 

Hypothesis which is being tested. It is denoted by H 0 . 

b. Alternative Hypothesis 

Counter-proposition to the null hypothesis. It is denoted by H 1 or H A . 


Example: Let the average sales amount of a grocery store be denoted by p. There could be three scenarios. 


Left-Tailed (One Tailed) 

:H 0 

p > 150 

and 

Hi 

p< 150 

Right-Tailed (One Tailed) 

:H 0 

p < 150 

and 

Hi 

p> 150 

Two-Tailed 

:H 0 

p = 150 

and 

Hi 

p* 150 
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Can We Accept the Null Hypothesis? 

Based on a sample, we can not conclude that H 0 is true. So, instead of concluding “we accept H 0 ”, we say “we fail to reject H 0 ”. 
Note: “Acceptance of H 0 ” applies H 0 is true, while “Failure to reject H 0 ” implies evidence from sample is not sufficient for rejecting H 0 . 


4.1.2. Tests of Significance 

Test of significance is, in general, a procedure to assess the significance of difference between two or more values. 

List of Some Commonly Used Tests: See Appendix A.1 

Example: The sales amount of a grocery store is normally distributed with standard deviation of $30.2. A sample of 
40 sales receipts has an average sales amount of $137. Based on sample information, one may want to test 
whether the mean of sales at the grocery store is different from $150. 

In this case, 

Assumed population mean (p 0 ) = 150 and Sample mean (x) = 137 
Objective is to test the significance of difference between p 0 and x 
Population is normally distributed and its variance is known 
Therefore, Z-Test should be applied 


S If the difference is not found to be significant, we do not reject H 0 and it is attributed to pure chance due to 
fluctuations in sampling 

S If the difference is found to be significant, we reject H 0 and it is attributed to some non-random cause 
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4.1.3. Test Statistic 

A test statistic is a value calculated from a sample to test the validity of null hypothesis. Since a test statistic is a 
random variable, it has a probability distribution. 


Example: 

Null Hypothesis : Average sales amount of grocery store = $150 

Alternative Hypothesis : Average sales amount of grocery store * $150 
Additional Information : 

1. Sales amount is normally distributed with standard deviation of $30.2 

2. From a sample of 40 transactions, average sales amount = $137 

Test Statistic : z = X ~^ = 137 =-2.722 

G/Vrc 30.2/V40 


[i.e. H 0 :p = 150 ] 
[i.e. H 1 : p * 150 ] 

[ i.e. X ~ N(p, 30.2 2 ) ] 
[ i.e. x = 137 ] 


Properties 

All good test statistics should have two properties: 

(a) They should tend to behave differently when H 0 is true from when H 1 is true; and 

(b) Their probability distribution should be calculable under the assumption that H 0 is true. It is also desirable that 
tables of this probability distribution exist 

Xexl 
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4,1.4 Types of Error, Level of Significance and Power of Test 

There can be two types of errors in Hypothesis Testing 

Type I Error : Error of rejecting H 0 when it is, in fact, true 
Type II Error : Error of not rejecting H 0 when it is, in fact, false 


Size of the Test (or Level of Significance) = 

■ Probability of committing a Type I Error 

■ Maximize size of Type I Error at risk 


o H 0 is not rejected 
'<75 
o 

q H 0 is rejected 


Situation 


H 0 is true 

Right Decision 

Type I Error 
P(Type I Error) = 


H 0 is false 

Type II Error 
P(Type I Error) = 

Right Decision 


Power of the Test = 





Probability of not 
committing a Type II 
Error 

Ability of the Test to 
Reject a False Null 
Hypothesis 



For any sample size, it is not possible to minimize both types of errors simultaneously 
In practice, a Type I error is likely to be more serious than a Type II error 
Classical Approach: 

1. Keep the probability of committing a type I error at a fairly low level (0.01,0.05 or 0.10) 

2. Then try to minimize the probability of having a type II error (i.e. maximize the power of the test) 

Set to 1%, 5% or 10% and then try to minimize 
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4,1.5. Confidence Interval 

Confidence interval is the interval in which we can be 100(1- 
Note: is the level of significance. 


Significance Level Confidence Level 

0.01 99% 

0.05 95% 

0.10 90% 


95% Confidence Interval for Mean (p) of a Large Sample: 


|i : Population Mean 
: Population Standard Deviation 


n : Sample Size 
x : Sample Mean 

X-\i 


X , ~ N(n,a 2 )=> X ~ N(n,o 1 /n) => Z, = ~ AT(0,1) 

a/yin 

From Normal Distribution Table, 

P( -1.96 <Z, <1.96) = 0.95 


P( -1.96 < 


X - 


< 1.96 ) = 0.95 


P( X - 1.96 -= < [L < X + 1.96 -= ) = 0.95 

Jn yin 
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)% sure that values of a particular statistic will lie. 


Interpretation of 95% Confidence Interval: 

A 95% Confidence Interval is interpreted as 
the range of values within which the 
population parameter will fall 95% of the 
time, if the procedure of calculating 
confidence intervals is repeated on multiple 
samples 


Xexl 









4.1.6. Critical Regions 

Confidence interval is called the acceptance region and the areas outside it are called the critical regions or regions 
of rejection of the null hypothesis. The lower and upper limits of the acceptance region are called the critical 

values. 


Example: Continuing with Grocery store example: 

H 0 : p = 150, : p * 150, X ~ N(p, 30.2 2 ), x = 137, n 


P( X - 1.96 -= < p < X +1.96 -= ) = 0.95 
yjn ■yjn 

30 2 30 2 

P( 137 -1.96 x —== < u <137 +1.96 x —== j = 0.95 

V40 V40 

P( 137 -9.4 < (i < 137 +9.4 ) = 0.95 
P( 127.6 < (1 < 146 A ) = 0.95 


95% Confidence Interval for p : [127.6, 146.4] 



Observation : p = 150 lies in the critical region 
Conclusion : H 0 is rejected at 5% significance level 

Xexl 


49 | June 30, 2015 | © 2015 ExIService Holdings, Inc. 










4.1 The p-Value or Exact Level of Significance 

Instead of preselecting (the level of significance) at arbitrary levels, such as 1%, 5% or 10%, one can obtain the p 
(probability) value, or exact level of significance of a test statistic. The p value is defined as the lowest significance 
level at which a null hypothesis can be rejected. 


Example: 

Null Hypothesis : Average sales amount of grocery store = $150 

Alternative Hypothesis : Average sales amount of grocery store * $150 
Additional Information : 

1. Sales amount is normally distributed with standard deviation of $30.2 

2. From a sample of 40 transactions, average sales amount = $137 


[i.e. H 0 :p = 150 ] 
[i.e. H 1 : p * 150 ] 

[ i.e. X ~ N(p, 30.2 2 ) ] 
[i.e. x = 137] 


Test Statistic : Z = X = 137 . ^ = -2.722 

G/Vrc 30.2/V40 

It is a two-tailed test. From table, p-value = 0.0065 

Interpretation: 

p-value = 0.65% is the exact level of significance. 

■ Since 0.65% < 1%, we reject at H 0 not only at 5% level, but 
even at 1% level of significance 

■ That is, we reject H 0 not with just 95% confidence, but with 
99% confidence. 



P(Z < -2.722) = 0.325%, P(Z > 2.722) = 0.325%: Adding up t^0 


.65% 

EXL 
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4.2 Step-by-Step Process 



Steps in Hypothesis Testing 

Process of hypothesis testing can be summarized in following five steps: 

Step 1: Set up Null Hypothesis ( ) and Alternative Hypothesis ( ), keeping in mind if it is going to be a single 

(left or right) tailed or a two-tailed test 

Step 2: Calculate the test statistic and determine its probability distribution 
Step 3: Choose the appropriate (1 %, 5% or 10%) level of significance ( ) 

Step 4: Evaluate the test statistic by either confidence interval approach or by p value approach 
Step 5: Conclude by either rejecting or not rejecting the null hypothesis at the given level of significance 
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Exercise 



Exercise 6. An insurance company is reviewing its current policy rates. When originally setting the rates they 
believed that the average claim amount was $1,800. They are concerned that the true mean is actually higher than 
this, because they could potentially lose a lot of money. They randomly select 40 claims, and calculate a sample 
mean of $1,950. Assuming that the standard deviation of claims is $500, set = 0.05 and test to see if the 
insurance company should be concerned. 

[Hint At 5% significance level, the critical value for a one tailed test from the table of z-scores = 1.645 ] 
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Chapter 5: Scatter Plots and Correlations 


Xexl 
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5.1 Scatter Plot 



5.1.1 Explanatory and Response Variables 

A response variable measures an outcome of a study. An explanatory variable explains or influences changes in a 
response variable. 

A response variable is also called dependent variable and is generally denoted by Y 
An explanatory variable is also called independent variable and is generally denoted by X 

Analysis of relationship between two variables is referred to as Bivariate Analysis. To run this analysis, both 
variables are measured on the same individuals. 

Relationships between two quantitative variables are best displayed graphically through a scatterplot 


5.1.2. Plotting and Visualizing a Scatter Plot 

A scatterplot shows the relationship between two quantitative variables measured on the same individuals. 

The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis 
Each individual in the data appears as the point in the plot fixed by the values of both variables for that individual 


12 

10 

8 

> 6 

4 

2 

0 

0 


♦ 


♦ 


♦ 


♦ ♦ ♦ 


2 4 6 8 10 12 

X 


S Always plot the explanatory variable on the 
horizontal axis (the x axis) of a scatterplot 
s If there is no explanatory-response distinction, 
either variable can go on the horizontal axis 

XIxl 
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5.1.3. Examining a Scatterplot 

Look for the overall pattern and for striking deviations from that pattern 




Form 


Direction 


Strength 


Form: Linear relationships, where the points show a straight-line pattern, are an important form of relationship 
between two variables. Curved relationships and clusters are other forms to watch for. 

Direction: If the relationship has a clear direction, there is either positive association (high values of the two 
variables tend to occur together) or negative association (high values of one variable tend to occur with low 
values of the other variable) 

Strength: The strength of a relationship is determined by how close the points in the scatterplot lie to a simple 
form such as a line 
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5.2 Correlation 



5.2.1 Pearson’s Correlation Coefficient 

Pearson’s Correlation Coefficient (r) measures the strength and direction of the linear association between two 
quantitative variables X and Y. Although correlation can be calculated for any scatterplot, Pearson’s correlation 
coefficient measures only straight-line relationships. 


^(X-XX^-Y) 


6 SAS Tip 



PROC CORR 
(Use OUTP Option) 


r = 


Correlation indicates the direction of a linear relationship by its sign 
r > 0 for a positive association 
r < 0 for a negative association 

Correlation indicates the strength of a linear relationship by magnitude of its absolute value 
Correlation ranges from -1 to +1 

The Closer to -1, the Stronger the Negative Linear Relationship 
The Closer to +1, the Stronger the Positive Linear Relationship 
The Closer to 0, the Weaker Any Linear Relationship 

r = +1 means perfect correlation, which occurs only when the points on a scatterplot lie exactly on a straight line 
Correlation ignores the distinction between explanatory and response variables 
The value of r is not affected by changes in the unit of measurement of either variable 
Correlation is not resistant, so outliers can greatly change the value of r 
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Graphical Illustration 


Y 
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Example 



A 

B 

C 

D 

1 

X 

Y 

XrX mean 

VY mean 

2 

101.0 

99.2 

24.7 

(35.3) 

3 

100.1 

99.0 

23.8 

(35.5) 

4 

100.0 

100.0 

23.7 

(34.5) 

5 

90.6 

111.6 

14.3 

(22.9) 

6 

86.5 

122.2 

10.2 

(12.3) 

7 

89.7 

117.6 

13.4 

(16.9) 

8 

90.6 

121.1 

14.3 

(13.4) 

9 

82.8 

136.0 

6.5 

1.5 

10 

70.1 

154.2 

(6.2) 

19.7 

11 

65.4 

153.6 

(10.9) 

19.1 

12 

61.3 

158.5 

(15.0) 

24.0 

13 

62.5 

140.6 

(13.8) 

6.1 

14 

63.6 

136.2 

(12.7) 

1.7 

15 

52.6 

168.0 

(23.7) 

33.5 

16 

59.7 

154.3 

(16.6) 

19.8 

17 

59.5 

149.0 

(16.8) 

14.5 

18 

61.3 

165.5 

(15.0) 

31.0 


19 


r = (-6023.1) / (4551,5*8894.2) 1/2 
r = -95% 
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E F G 


Y -Y ) 

i mean' 

(Xi-X mean )2 

(Y -Y )2 

' i mean' 

(871.6) 

609.5 

1246.5 

(844.6) 

565.9 

1260.7 

(817.4) 

561.1 

1190.7 

(327.3) 

204.2 

524.7 

(125.4) 

103.8 

151.4 

(226.3) 

179.2 

285.8 

(191.5) 

204.2 

179.7 

9.7 

42.1 

2.2 

(122.3) 

38.6 

387.9 

(208.4) 

119.1 

364.6 

(360.2) 

225.4 

575.7 

(84.2) 

190.8 

37.1 

(21.5) 

161.6 

2.9 

(794.2) 

562.2 

1121.9 

(328.8) 

276.0 

391.8 

(243.7) 

282.6 

210.1 

(465.3) 

225.4 

960.6 

(6023.1) 

4551.5 

8894.2 


Xexl 





5.2.2. Spearman’s Rank Correlation Coefficient 

Spearman’s rank correlation coefficient (rho), is a non-parametric measure of statistical dependence between two 
variables. It assesses how well the relationship between two variables can be described using a monotonic 
function. 

6 ±d, 

p = 1 -^-, where d, - x - y. 

n(n 2 - 1) 

Example: 


6 SAS Tip 

""PROC CORR 

(Use OUTS Option) 


1 

A 

Candidate 

B 

Judge A’s Ranking 

C 

Judge B’s Ranking 

D 

d 

E 

d 2 

2 

A 

1 

2 

-1 

1 

3 

B 

2 

1 

1 

1 

4 

C 

3 

4 

-1 

1 

5 

D 

4 

3 

1 

1 

6 

E 

5 

6 

-1 

1 

7 

F 

6 

7 

-1 

1 

8 

G 

7 

5 

2 

4 

9 

Sum 




10 


= 1 -[ 6 ( 10 )/ 7 ( 49 - 1 )] 
= 82 . 14 % 
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A.1 List of Some Commonly Used Tests 


Statistical Tests 

Investigation of significance of 

Difference between an assumed population mean p 0 and a sample mean x , 

■ When the population variance is known 

- When the population variance is unknown 
Difference between two sample means, one from each population, 

■ When the population variances are known and equal 

■ When the population variances are known and unequal 

■ When the population variances are unknown but equal 

■ When the population variances are unknown and unequal 

■ When observations for the two sample are obtained in pairs 

■ Difference between an assumed population proportion and an observed sample proportion 
Difference between two sample proportions, one from each population 

Difference between two counts 

Difference between an assumed population variance and a sample variance 
Difference between two sample variances, one from each population 

■ A variable in Regression Model (i.e. difference between a regression coefficient and zero) 
Overall regression model 


Back to Main Slide 


[ Z-test ] 
[ t-test ] 

[Z-test ] 
[ Z-test ] 
[ t-test ] 
[ t-test ] 
[ t-test ] 
[ Z-test ] 
[ Z-test ] 
[ Z-test ] 
[ 2 -test ] 
[ F-test ] 
[ t-test ] 
[ F-test ] 
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A.2 Examples of Discrete Distributions 



A.2.1. Discrete Uniform Distribution 


Back to Main Slide 


Uniform Distribution n = 4 


0.30 
>> 0.25 

1 °- 20 
S 0.15 
o 0.10 
a. 0.05 
0.00 


♦ 


♦ 




0 1 2 3 4 5 


x 


Uniform Distribution n = 5 


0.30 
>> 0.25 

1 °- 20 
S 0.15 
o 0.10 
i o.o5 
0.00 


t t 


i-r 



t 


0 1 2 3 4 5 


x 


Probability Mass Function 

P(X =k) = —, k=l,2,...,N 
N 


Main Properties 

This distribution is used to model experimental outcomes which are “equally likely” 


Applications 

Tossing of a fair and unbiased die. For the given sample space {1,2, 3, 4, 5, 6}, each number occurs with a probability of 1/6 


Xexl 
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A.2.2. Binomial Distribution 


Back to Main Slide 


Binomial Distribution 


■Q 

to 

-O 

o 


0.20 
0.15 
0.10 
0.05 

"♦ A* m m _ A 

0,00 . • .V.. . 

0 5 10 15 20 25 30 35 40 45 50 55 

x 


(n=50, p=0.1) 
■ (n=50, p=0.3) 
(n=50, p=0.5) 
* (n=50, p=0.7) 
(n=50, p=0.9) 


Binomial Distribution 


-Q 

to 

o 


0.30 
0.20 
0.10 
0.00 


■ A P* * 

ftttg. : I1J1 ■’yrilXm 


0 5 10 15 20 25 30 35 40 45 50 55 

x 


♦ (n=10, p=0.5) 
■ (n=20, p=0.5) 
(n=30, p=0.5) 
« (n=40, p=0.5) 
(n=50, p=0.5) 


Probability Mass Function 

k n _ k n : Number of trials 

P(X = k)= C k p (l-p) , k=0,l,...,n p : Probability of success on a single trial 

Main Properties 

Binomial distribution is used when there are exactly two mutually exclusive outcomes of a trial, labeled as ‘success’ and ‘failure’ 
Mean = n p and variance = n p (1 -p) 

The binomial distribution is probably the most commonly used discrete distribution 


Applications 

Customer Retention (Identification of Attrition Cases) 
Fraud Detection (Identification of Defaulters) 
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A.2.3. Hypergeometric Distribution 
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Probability Mass Function 

M N—M 

PiX =k) = -^, L<k <U, where L = max/0, M-N + nj and U = min/n, M} 

Main Properties 


N : Number of items in population 
M : Number of defective items in population 
n : Number of items in a sample 
k : Number of defective items in the sample 


The Hypergeometric (n, M, N) distribution models the number of items of a particular type that are there in a sample of size n 
where that sample is drawn from a population of size N of which M are also of that particular type 


Mean = n (M / N) and Variance = n (M / N) [1 - (M / N)] [(N-n) / (N-1)] 


Applications 

Sampling without replacement 


From a lot of N items with M defective pieces, what is the probability of getting k defective items by the customer if he purchases 
n units? 
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A.2.4. Poisson Distribution 


Back to Main Slide 


Poisson Distribution 



0.40 
0.30 
0.20 1 . .. 
0.10 

♦ A ■ 

0.00 

0 5 



10 15 20 25 30 35 40 45 50 55 


x 


= 1 
= 5 
= 10 
= 15 


= 20 


Probability Mass Function 
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k\ 

Main Properties 


: The shape parameter which indicates the average number of events in the given time interval 


Poisson distribution is used to model the number of events occurring within a given time interval 
Mean = Variance = 

Poisson distribution is very commonly used to model count data 


Applications 

Count of phone calls arriving at a call centre per minute 

Count of insurance claims made by customers in a given period of time 
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A.2.5. Geometric Distribution 
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Probability Mass Function 

P(X = k) = (l-p) k p, k = 0,1,2,... p : Probability of success on each trial 

Main Properties 

Geometric distribution is used for modeling number of failures until the first success. It expresses the probability of exactly k 
failures until the first success to occur or equivalently, the probability that exactly (k + 1) trials are required to get the first success 
Mean = (1 -p) / p and Variance = (1 -p) / p 2 


Applications 

■: Number of phone calls required before making a sale 

■ Number of dry wells an oil company will drill in a particular area before getting an oil-producing well 
* Number of proposals to get a‘Yes’ 
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A.2.6. Negative Binomial Distribution 
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Probability Mass Function 

p : Probability of success on each trial 

P(X = k)= r+k - x C k p\\-p) k , k = 0,1,2,...; 0 < p < 1 r: Threshold number of successes 

k : Number of failures until the r th success 

Main Properties 

■ Negative Binomial distribution gives the probability of observing k failures before the r th success or equivalently, probability that 
k+r trials are required until the r th success to occur 
Mean = r (1 -p) / p and Variance = r (1 -p) / p 2 

Applications 

Games (what’s the probability that a basket ball player makes his 3 rd free throw on his 5 th shot?) 

Marketing (what’s the probability that a door-to-door salesman sells the last candy bar at the 10 th house?) 
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A.3 Examples of Continuous Distributions 


A.3.1. Continuous Uniform Distribution 
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Standard Uniform Distribution 
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Main Properties 

The uniform distribution defines equal probability over a given range for a continuous distribution 
‘a’ is the location parameter and ‘b-a’ is the scale parameter 
In case of a standard uniform distribution, a = 0 and b = 1 

Applications 

Generation of random numbers: Almost all random number generators generate random numbers on the (0,1) interval 
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A.3.2. Normal Distribution 
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Probability Density Function 

\ : Mean 

f(x) = — -j=e 2al , -oo<*<oo,-co<|i<oo,g >0 : Standard Deviation 

o V 2 tc 

Main Properties 

It is symmetrical around its mean value and is also called the ‘bell-shaped’ curve 
68%, 95% and 99.7% of area under the normal curve lies within ± , ± 2 , and + 3 respectively 
■ Normal distribution with mean = 0 and variance = 1 is called standard normal distribution 


Standard Normal 
Distribution 

= 0 and 2 = 1 

X-[i 
G 



Central Limit Theorem: Let X 1; X 2 , ... , X n denote n independent random variables, all of which have same PDF with mean = 
and variance = 2 . Let X-bar = X,/ n. Then as n increases indefinitely (i.e. as n oo), X-bar approaches the normal distribution 
with mean = and variance = 2 / n 


Applications 


Any process yielding values that tend to be symmetric around a central value (the mean) generally follow a 
like height of individuals and marks of candidates in an entrance exam 
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A.3.3. Chi-Square ( 2 ) Distribution 
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f(x) = — r7 z - e 2 x 2 , x>0,k>0, is the Gamma Function : (a) = t a e dt Degrees of freedom 
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Main Properties 

If Z-,, Z 2 , ... , Z k are independent standardized normal variables, then Z = Z, 2 is said to possess the 2 distribution with df = k 
The 2 distribution is a skewed distribution, the degree of skewness depending on the df. For comparatively few df, the 
distribution is highly skewed to the right; but as the df increases, the distribution becomes increasingly symmetrical 
Mean = k and Variance = 2k 



Applications 

Hypothesis Testing (e.g. to test if a sample variance is equal to an assumed population variance) 
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A.3.4. F Distribution 
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Main Properties 


m : DF of 2 random variable in the numerator 
n DF of 2 random variable in the denominator 


If Xi and X 2 are independently distributed chi-square variables with m and n df respectively, then F = [(X 1 / m) / (X 2 / n)] follows 
(Fisher’s) F distribution with m and n df 

Like 2 distribution, F distribution is skewed to the right. But as m & n become large, F distribution approaches normal distribution 


Applications 

Hypothesis Testing (e.g. Test of equality of variance of two populations, ANOVA test etc) 
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A.3.5. Student’s t Distribution 
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Main Properties 

Let Z and S be independent random variables such that Z ~ N(0,1) and n S 2 ~ 2 n . The distribution of t = Z / S is called Student’s 
t distribution with df = n. As df increases, t distribution approximates the normal distribution. 

Mean = 0 for n > 1, Variance = n / (n-2) for n > 2, Median = 0, Skewness = 0 


Applications 

Hypothesis Testing (e.g. Estimation of population parameters when sample size is small or when population variance is unknown) 
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A.3.6. Exponential Distribution 
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Exponential Distribution 



Probability Density Function 

f(x) = Xe~ Xx , x>0,X>0 


Main Properties 

The exponential distribution is used to model a random variable with mean 1 / , that represents the waiting time until the first 
even to occur, where events are generated by a Poisson process with mean (i.e. events occur continuously and independently 
at a constant average rate ) 

Mean = 1 / and Variance = 1 / 2 

In case of a standard exponential distribution, = 1 


Applications 

The exponential distribution is primarily used in reliability applications 
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A.3.7. Gamma Distribution 
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Probability Density Function 

a - x a -1 

f(x) = —-———, x>0,X>0,a>0 
(a) 

Main Properties 

The gamma distribution can be viewed as a generalization of the exponential distribution. The gamma random variable 
represents the waiting time until the a th event to occur. 

Mean = a/ and Variance = a/ 2 


Applications 

For modeling size of insurance claims 
For modeling size of rainfalls 
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A.3.8. Beta Distribution 
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Probability Density Function 

fix) = (a +b) x a l (l-x) b \ 0 < x <1, a > 0,b > 0 
(a) (b) 

Main Properties 

■ a and b are shape parameters. The density curve is 
U-shaped when a < 1 and b < 1 
Symmetric about 0.5 when a = b > 1 
J-shaped when (a-1)(b-1) < 0 
Unimodal for all other values of a and b 
Mean = a / (a + b) and Variance = a b / [(a + b) 2 (a + b + 1)] 

Applications 

For modeling events which are constrained to take place within an interval defined by a minimum and maximum value 
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For queries, contact Varun Aggarwal at Varun.Aaaarwal@exlservice.com 
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